Abstract: Data clustering is a challenging issue because of the complex and heterogeneous natures of multidimensional information. On the other hand very few clustering methods can successfully deal with the multidimensional datasets and it becomes even hard to handle such large amounts of information. For datasets that don't conceivable to store even on a solitary plate, parallelism is a fantastic choice. Map Reduce is a programming framework to process large scale data in a massively parallel way. We utilized DBScan calculation for creating groups and tested the device on manufactured and constant datasets got from UCI. We adopt a quick partitioning strategy for large scale non-indexed data. We consider the metric of converge among circumscribing parcels and make advancements on it. Finally, we assess our work on genuine expansive scale datasets utilizing Hadoop platform. Results reveal that the speedup and scale up of our work are very efficient.

Keywords: Data clustering, MapReduce, DBScan, Hadoop.